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Abstract 

Experimental verification has been the method of choice for verifying the 
stability of a multi-agent reinforcement learning (MARL) algorithm as the 
number of agents grows and theoretical analysis becomes prohibitively com- 
plex. For cooperative agents, where the ultimate goal is to optimize some 
global metric, the stability is usually verified by observing the evolution of 
the global performance metric over time. If the global metric improves and 
eventually stabilizes, it is considered a reasonable verification of the system's 
stability. 

The main contribution of this note is establishing the need for better 
experimental frameworks and measures to assess the stability of large-scale 
adaptive cooperative systems. We show an experimental case study where 
the stability of the global performance metric can be rather deceiving, hiding 
an underlying instability in the system that later leads to a significant drop 
in performance. We then propose an alternative metric that relies on agents' 
local policies and show, experimentally, that our proposed metric is more 
effective (than the traditional global performance metric) in exposing the 
instability of MARL algorithms. 
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1. Introduction 



The term convergence, in reinforcement learning context, refers to the sta- 
bility of the learning process (and the underlying model) over time. Similar 
to single agent reinforcement learning algorithms (such as Q-learning ([]])), 
the convergence of a multi-agent reinforcement learning (MARL) algorithm is 
an important property that received considerable attention (@;S0;lll)- How- 
ever, proving the convergence of a MARL algorithm via theoretical analysis 
is significantly more challenging than proving the convergence in the single 
agent case. The presence of other agents that are also learning deem the en- 
vironment non- stationary, therefore violating a foundational assumption in 
single agent learning. In fact, proving the convergence of MARL algorithm 
even in 2-player-2-action single-stage games (arguably the simplest class of 
multi-agent systems domains) has been challenging (0; 0; [El). 

As a consequence, experimental verification is usually the method of 
choice as the number of agents grows and theoretical analysis becomes pro- 
hibitively complex. For cooperative agents, researchers typically verified the 
stability of a MARL algorithm by observing the evolution of some global 
performance metric overtime (0; 0; 0; 0; 0). This is not surprising since the 
ultimate goal of a cooperative system is to optimize some global metric. Ex- 
amples of global performance metrics include the percentage of total number 
of delivered packets in routing problems (jlOl ). the average turn around time 
of tasks in task allocation problems ((HI), or the average reward (received by 
agents) in general (jsl). 

If the global metric improves over time and eventually appears to stabilize, 
it is usually considered a reasonable verification of convergence (0; Q; 0; @; 0). 
Even if the underlying agent policies are not stable, one could argue that at 
the end, global performance is all that matters in a cooperative system. 

This paper challenges the above (widely-used) practice and establishes 
the need for better experimental frameworks and measures for assessing the 
stability of large-scale cooperative systems. We show an experimental case 
study where the stability of the global performance metric can hide an un- 
derlying instability in the system. This hidden instability later leads to a 
significant drop in the global performance metric itself. We propose an al- 
ternative measure that relies on agents' local policies: the policy entropy. 
We experimentally show that the proposed metric is more effective than the 
traditional global performance metric in exposing the instability of MARL 
algorithms in large-scale multi-agent systems. 
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The paper is organized as follows. Section [2] describes the case study we 
will be using throughout the paper. Section [3] reviews MARL algorithms 
(with particular focus on WPL and GIGA-WoLF, the two algorithms we use 
in our experimental evaluation). Section H] presents our initial experimental 
results, where the global performance metric leads to a (misleading) con- 
clusion that a MARL algorithm converges. Section presents our proposed 
measure and illustrates how it is used to expose the hidden instability of a 
MARL algorithm. We conclude in Section [6j 

2. Case Study: Distributed Task Allocation Problem (DTAP) 

We use a simplified version of the distributed task allocation domain 
(DTAP) (0), where the goal of the system is to assign tasks to agents such 
that the service time of each task is minimized. For illustration, consider the 
example scenario depicted in Figure HJ Agent AO receives task Tl, which 
can be executed by any of the agents AO, Al, A2, A3, and A4. All agents 
other than agent A4 are overloaded, and therefore the best option for agent 
AO is to forward task Tl to agent A2 which in turn forwards the task to its 
left neighbor (A5) until task Tl reaches agent A4. Although agent AO does 
not know that A4 is under-loaded (because agent AO interacts only with its 
immediate neighbors), agent AO should eventually learn (through experience 
and interaction with its neighbors) that sending task Tl to agent A2 is the 
best action without even knowing that agent A4 exists. 




Figure 1: Task allocation using a network of agents. 

The DTAP domain has an essential property that appears in many real 
world problems yet not captured by most of the domains that were used to 
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analyze MARL algorithms experimentally: communication delay. The effect 
of an action does not appear immediately because it is communicated via 
messages and messages take time to route. Not only is the reward delayed 
but so is any change in the system's state. A consequence of communication 
delay is partial observability: an agent can not observe the full system state 
(the queues at every other agent, messages on links and in queues, etc.). 

Each time unit, agents make decisions regarding all task requests received 
during this time unit. For each task, the agent can either execute the task 
locally or send the task to a neighboring agent. If an agent decides to execute 
the task locally, the agent adds the task to its local queue, where tasks are 
executed on a first come first serve basis, with unlimited queue length. 

Each agent has a physical location. Communication delay between two 
agents is proportional to the Euclidean distance between them, one time unit 
per distance unit. Agents interact via two types of messages. A REQUEST 
message (i, j, T) indicates a request sent from agent % to agent j requesting the 
execution of task T. An UPDATE message (i,j,T,R) indicates a feedback 
(reward signal) from agent % to agent j that task T took R time steps to 
complete (the time steps are computed from the time agent % received T's 
request). 

The main goal of DTAP is to reduce the total service time, averaged over 
tasks, AT ST = Tg % , where T T is the set of task requests received 

\i t\ 

during a time period r and TST{T) is the total time a task T spends in the 
system. The TST(T) time consists of the time for routing a task request 
through the network, the time the task request spends in the local queue, 
and the time of actually executing the task. 

Although the underlying simulator has different underlying states, we 
deliberately made agents oblivious to these states. The only feedback an 
agent gets is its own reward. This simplifies the agent's decision problem and 
re-emphasises partial observability: agents collectively learn a joint policy 
that makes a good compromise over the different unobserved states (because 
the agents can not distinguish between these states). 
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3. Multiagent Reinforcement Learning 

The experimental results in the following section focus on two gradient- 
ascent MARL algorithms: GIGA-WoLF Q) and WPL We chose these 
two algorithms because they allow agents to learn a stochastic policy based on 
the expected reward gradient. Both algorithms were also shown to converge 
in benchmark two-player-two-action games as well as some larger games. 
The specifics of WPL and GIGA-WoLF (such as their update equations, the 
underlying intuition, their differences and similarities) are neither relevant 
to the purpose of this paper nor needed to follow our analysis in Section 
HJ Nevertheless, and for completeness, we mention below (very briefly) the 
equations for updating the policy for the two algorithms. Further details 
regarding the two algorithms can be found elsewhere (0; 0). 

An agent i using WPL updates its policy iii according to the following 
equations: 



Vj G neighbor s(i) : A7r' +1 (j) 

where rj is a small learning constant and Vi(iTi) is the expected reward 
agent i would get if it interacts with its neighbors according to policy 7Tj. 
The projection function ensures that after adding the gradient A^j to the 
policy, the resulting policy is still valid. 

An agent i using GIGA-WoLF updates its policy 7Tj according to the 
following equations: 



9^t(j) \ l-7r|(j) otherwise 
projection^* + A7r' +1 ) 



*A large number of MARL algorithms have been proposed that vary in their in their 
underlying assumptions and target domains (jllf ). MARL algorithms that can only learn a 
deterministic policy (such as Q-learning {]])) are not suitable for the DTAP domain. For 
example, even if two neighbors have practically the same load, Q-learning will assign all 
incoming requests to one of the neighbors until a feedback is received later indicating a 
change in the load. On the other hand, an agent using a gradient ascent MARL algorithm 
has the ability to adjust its policy to a non-deterministic (or stochastic) distribution (0). 
Q-learning was successfully used in the packet routing domain |5[l2h. where load balancing 
is not the main concern (the main objective is routing a packet from a particular source 
to a particular destination). 
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7r' +1 = projection^ + nV^Ti) 1 ) 
= projection^ + t]V^(tt)/3) 

4. Stability Under the Global Metric Results 

We have evaluated the performance of WPL and GIGA WoLF using the 
following settingJl 100 agents are organized in a 10x10 grid. Communication 
delay between two adjacent agents is two time units. Tasks arrive at the 4x4 
sub-grid at the center at rate 0.5 tasks/time unit. All agents can execute a 
task with a rate of 0.1 task/time unit (both task arrival and service durations 
follow an exponential distribution). Figure [2] illustrates the setting. 

Figure [3] plots the global performance (measured in terms of ATST) of the 
two multi-agent learning algorithms in the DTAP domain. Just by looking at 
ATST plot, it is relatively safe to conclude that WPL converges quickly while 
GIGA- WoLF converges after about 75,000 time steps. The following section 
presents the measure we have used in order to discover that the stability of 
GIGA- WoLF is actually spurious. 

5. Verifying Stability Using Policy Entropy 

Ideally, we would want to visualize and analyze the evolution of all learn- 
ing parameters, including action values and the policy of all agents. However, 
going to such detail is only possible for small number of agents. As the num- 
ber of agents increases, one needs aggregated measures that summarize the 
system's behavior yet can reflect the stability of the system's dynamics. 

We propose using a simple measure that summarizes an agent's policy 
into a single number: the policy entropy, H(ir x ), for a particular agent x: 

H M = - Yl n x (y)lgTT x (y) 

y (^neighbor s(x) 



The simulator is available online at http : //www . cs . umass . edu/^shario/dtap . html 




Figure 2: The simulation setting for the DTAP domain. Only the 16 nodes at the 
center receive tasks from the environment. A node's diameter reflects its local 
queue length. 




1-W0LF 
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Figure 3: Comparing the average total service time for 200,000 time steps of the 
DTAP problem for WPL and GIGA-WoLF. 
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The function ir x is the policy of agent x {i^ x {y) is the probability that 
agent x interacts with its neighbor y). In case of gradient ascent, the pol- 
icy is explicitely learned. For deterministic learners (such as Q-learning), 
the effective policy can be estimated by counting the number of times each 
neighbor is chosen. 

Figure @] plots the average policy entropy and the associated standard 
deviation, over the 100 agents, against time. Agent policies under WPL do 
converge but the policies under GIGA-WoLF have not converged yet. The 
policy entropy is still decreasing, which suggests that GIGA-WoLF is still 
adapting. 




Time 



Figure 4: The policy entropy of WPL and GIGA-WoLF for 200,000 time steps. 

This intrigued us to rerun the simulator, this time allowing the simulator 
to run for 600,000 times steps instead of just 200,000 time steps. To our 
surprise, the global performance metric (the ATST in this case) of GIGA- 
WoLF starts slowly to diverge after 250,000 time steps and the corresponding 
policy entropy continue to decrease. WPL's policy entropy remains stable, 
as well as the global performance metric. 

More in-depth analysis is needed in order to fully understand the dy- 
namics of GIGA-WoLF and WPL in the DTAP domain and in large-scale 
systems in general. However, this is beyond the scope of this research note 
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Figure 5: Comparing the average total service time for 600,000 time steps of the 
DTAP problem for WPL and GIGA-WoLF. 




Figure 6: The average policy entropy of WPL and GIGA-WoLF for 600,000 time 
steps. 
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and should not distract from the main point we are trying to make: the 
common practice of using a global performance metric to verify the stability 
of a MARL algorithm is not reliable and can be misleading. 

6. Conclusion and Future work 

The main contribution of this paper is showing that using a global perfor- 
mance metric for verifying the stability (or even the usefulness) of a MARL 
algorithm is not a reliable methodology. In particular, we present a case study 
of 100 agents where the global performance metric can hide an underlying 
instability in the system that later leads to a significant drop in performance. 
We propose a measure that successfully exposes such instability. 

One of the issues indirectly raised by this paper is for how long shall a 
performance metric be stable in order to conclude the stability of the un- 
derlying MARL algorithm? Currently, no theoretical framework addresses 
this question, which we believe to be an essential requirement for adopting 
MARL in practical large-scale applications. 
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